Data Wrangling 
with dplyr and tidyr 
Cheat Sheet 


Tidy Data -A foundation for wrangling in R 


, m Tidy data complements R’s vectorized M ji A| ; F 
In a tidy operations. R will automatically preserve i 
data set: observations as you manipulate variables. ——p 
Each variable is saved Each observationis No other format works as intuitively with R. M x A 


saved in its own row 


Reshaping Data -Change the layout of a data set 


in its own column 


dplyr::tbl_df(iris) 
Converts data to tbl class. tbl’s are easier to examine than 
data frames. R displays only the data that fits onscreen: 


Source: local data frame [150 x 5] 


Variables not shown: Petal.Width (dbl), 
Species (fctr) 


dplyr::glimpse(iris) 
Information dense summary of tbl data. 
utils: View(iris) 
View data set in spreadsheet-like display (note capital V). 


\_jiris x =f | 
&\| Y Filter Q 
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 

5.1 3s 1.4 0.2 setosa | 
4.9 3.0 1.4 0.2 setosa | 
4.7 3.2 1.3 0.2 setosa 

4.6 3:1 LS 0.2 setosa | 
5.0 3.6 1.4 0.2 setosa | 
5.4 3.9 EF 0.4 setosa | 
4.6 3.4 1.4 0.3 setosa | 


5.0 3.4 1.5 0.2 setosa 
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dplyr::%>% 
Passes object on left hand side as first argument (or. 
argument) of function on righthand side. 


x %>% f(y) isthesameas f(x, y) 
y %>% f(x, ., z) isthesameas f(x, y, z ) 


"Piping" with %>% makes code more readable, e.g. 


iris %>% 
group_by(Species) %>% 
summarise(avg = mean(Sepal.Width) ) %>% 
arrange(avg) 
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DD — | — | ibm dplyr::data_frame(a = 1:3, b = 4:6) 
| EE | = -> =] ‘ À 
= = Combine vectors into data frame 
—— —— (optimized). 
— — dplyr::arrange(mtcars, mpg) 


Order rows by values of a column 

(low to high). 
dplyr::arrange(mtcars, desc(mpg)) 

Order rows by values of a column 


tidyr::gather(cases, "year", "n", 2:4) 
Gather columns into rows. 


tidyr::spread(pollution, size, amount) 
Spread rows into columns. 


_ En _ En | 
=— Bae EHE -> = (high to low). 
ea Hoo mma = 


dplyr:rename(tb, y = year) 
Rename the columns of a data 
frame. 


tidyr::separate(storms, date, c("y", "m", "d")) 
Separate one column into several. 


tidyr::unite(data, col, ..., sep) 
Unite several columns into one. 


Subset Variables (Columns) 


Subset Observations (Rows) 


cae 
> Bee == = 
aS faa ae 
Saas 
dplyr::filter(iris, Sepal.Length > 7) 
Extract rows that meet logical criteria. 
dplyr::distinct(iris) 
Remove duplicate rows. 


dplyr::select(iris, Sepal.Width, Petal.Length, Species) 
Select columns by name or helper function. 


` Helper functions for select - ?select 
select(iris, contains(".")) 


dplyr::sample_frac(iris, 0.5, replace = TRUE) Select columns whose name contains a character string. 
Randomly select fraction of rows. oe sar nik ie a 
Secs Select columns whose name ends with a character string. 
dplyr::sample_n(iris, 10, replace = TRUE) ra ivernih & 
Randomly select n rows. Select every column. 
dplyr::slice(iris, 10:15) select(iris, matches(".t.")) 
Select rows by position. Select columns whose name matches a regular expression. 
select(iris, num_range("x", 1:5)) 
dplyr:top_n(storms, 2; date) Select columns named x1, x2, x3, x4, x5. 
Select and order top n entries (by group if grouped data). select(iris, one_of(c("Species", "Genus"))) 


— 5 . 5 i . Select columns whose names are in a group of names. 
Logic in R - ?Comparison, ?base::Logic select(iris, starts_with("Sepal")) 


< Less than l= 


Not equal to Select columns whose name starts with a character string. 
> Greater than win% Group membership select(iris, Sepal.Length:Petal.Width) 
== Equal to is.na Is NA Select all columns between Sepal.Length and Petal.Width (inclusive). 
<= | Less than or equal to !is.na Is not NA select(iris, -Species) 
>= (Greaterthanorequalto &,|,!,xor,any,all Boolean operators Select all columns except Species. 
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Make New Variables Combine while Sets 


ESS aa >m ti 


dplyr::summarise(iris, avg = mean(Sepal.Length)) 
Summarise data into single row of values. 
dplyr::ssummarise_each(iris, funs(mean)) 
Apply summary function to each column. 
dplyr::count(iris, Species, wt = Sepal.Length) 
Count number of rows with each unique value of 
variable (with or without weights). 


i> - 


Summarise uses summary functions, functions that 
take a vector of values and return a single value, such as: 


dplyr::first min 

First value of a vector. Minimum value in a vector. 
dplyr::last max 

Last value of a vector. Maximum value in a vector. 
dplyr::nth mean 

Nth value of a vector. Mean value of a vector. 
dplyr::n median 

# of values in a vector. Median value of a vector. 
dplyr::n_distinct var 


# of distinct values in Variance of a vector. 
a vector. sd 
IQR 


IQR of a vector. 


Standard deviation of a 
vector. 


Group Data 


dplyr::group_by(iris, Species) 

Group data into rows with the same value of Species. 
dplyr::ungroup(iris) 

Remove grouping information from data frame. 


iris %>% group_by(Species) %>% summarise(...) 
Compute separate summary row for each group. 


-> 
| 
a | a | ee | Beis 
| | | | 
ee | 
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dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) 
Compute and append one or more new columns. 
dplyr::mutate_each(iris, funs(min_rank)) 
Apply window function to each column. 
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) 
Compute one or more new columns. Drop original columns. 


3 O 
window — 
function = 

OO 


Mutate uses window functions, functions that take a vector of 
values and return another vector of values, such as: 


dplyr::lead dplyr::cumall 
Copy with values shifted by1. Cumulativeall 
dplyr::lag dplyr::;cumany 


Cumulative any 
dplyr::cummean 
Cumulative mean 


Copy with values lagged by 1. 
dplyr::dense_rank 
Ranks with no gaps. 


dplyr::min_rank cumsum 

Ranks. Ties get min rank. Cumulative Sum 
dplyr::percent_rank cummax 

Ranks rescaled to [0, 1]. Cumulative max 
dplyr::row_number cummin 

Ranks. Ties got to first value. Cumulative min 
dplyr::ntile cumprod 

Bin vector into n buckets. Cumulative prod 
dplyr::between pmax 

Are values between a and b? Element-wise max 
dplyr::;cume_dist pmin 


Cumulative distribution. Element-wise min 


iris %>% group_by(Species) %>% mutate(...) 
Compute new variables by group. 
aaa eae | | a ee | | | [jj]. 
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Mutating Joins 


aet dplyr::left_join(a, b, by = "x1") 
B 2 GFE Join matching rows from bto a. 
Leif a dplyr::right_join(a, b, by = "x1") 
—— 2 Join matching rows from a to b. 
arhe he dplyr::inner_join(a, b, by = "x1") 
B 2 B Join data. Retain only rows in both sets. 
FARARS dplyr::full_join(a, b, by = "x1") 
B 2 — Join data. Retain all values, all rows. 
Di na M 


Filtering Joins 


Ee dplyr::semi_join(a, b, by ="x1") 

E 2 All rows in a that have a match in b. 
Emez dplyr::anti_join(a, b, by = "x1") 

Cc B 


All rows in a that do not have a match in b. 


Set Operations 


EEE dplyrintersect(y, z) 


—_— Rows that appear in both y and z. 
sec | x2 | 

—— dplyr::union(y, z) 
aan ee Rows that appear in either or both y and z. 
ENE dplyr::setdiff(y, z) 


Rows that appear in y but not z. 


Binding 


mmea 
BNE dplyr::bind_rows(y, z) 
—— Append z to y as new rows. 
o i 
ers dplyr::bind_cols(y, z) 

AUDA) Append z to y as new columns. 

B 2 Ei a 

c 3 piy Caution: matches rows by position. 
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